Multiple Clustering Views via Constrained Projections

نویسندگان

  • Xuan-Hong Dang
  • Ira Assent
  • James Bailey
چکیده

It is well known that off-the-shelf clustering methods may discover different patterns in a given set of data. This is because each clustering algorithm has its own bias resulting from the optimization of different criteria. Furthermore, there is no ground truth against which the clustering result can be validated. Thus, no crossvalidation technique can be carried out to tune input parameters involved in the process. As a consequence, the user has no guidelines for choosing the proper clustering method for a given data set. The use of clustering ensembles has emerged as a technique for overcoming these problems. A clustering ensemble consists of different clusterings obtained from multiple applications of any single algorithm with different initializations, or from various bootstrap samples of the available data, or from the application of different algorithms to the same data set. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned, or to the variance induced by different data samples. Another issue related to clustering is the so-called curse of dimensionality. Data with thousands of dimensions abound in fields and applications as diverse as bioinformatics, security and intrusion detection, and information and image retrieval. Clustering algorithms can handle data with low dimensionality, but as the dimensionality of the data increases, these algorithms tend to break down. This is because in high dimensional spaces data become extremely sparse and are far apart from each other. A common scenario with high-dimensional data is that several clusters may exist in different subspaces comprised of different combinations of features. In many real-world problems, points in a given region of the input space may cluster along a given set of dimensions, while points located in another region may ∗Department of Computer Science, George Mason University, [email protected] form a tight group with respect to different dimensions. Each dimension could be relevant to at least one of the clusters. Common global dimensionality reduction techniques are unable to capture such local structure of the data. Thus, a proper feature selection procedure should operate locally in the input space. Local feature selection allows one to estimate to which degree features participate in the discovery of clusters. As a result, many different subspace clustering methods have been proposed. Traditionally, clustering ensembles and subspace clustering have been developed independently of one another. Clustering ensembles address the ill-posed nature of clustering, but don’t address in general the curse of dimensionality problem. Subspace clustering avoids the curse of dimensionality in high-dimensional spaces, but typically requires the setting of critical input parameters whose values are unknown. To overcome these limitations we have introduced a unified framework that is capable of handling both issues: the ill-posed nature of clustering and the curse of dimensionality. Addressing these two issues is nontrivial as it involves solving a new problem altogether: the subspace clustering ensemble problem. Our approach takes two different perspectives: in the one case we model the problem as a multiand single-objective optimization one [3, 2, 1]; in the other we take a generative view, and assume that the base clusterings are generated from a hidden consensus clustering of the data [5, 4]. Both directions are promising and lead to interesting challenges. The first can yield general and efficient solutions, but requires as input the number of clusters in the consensus clustering. The second has higher complexity, but provides a principled solution to the “How many clusters?” question. In this talk, I focus on the first approach. I introduce the formal definition of the problem of subspace clustering ensembles, and heuristics to solve it. The objective is to define methods to exploit the information provided by an ensemble of subspace clustering solutions to compute a robust consensus subspace clustering. The problem is formulated as a multiand single-objective optimization problem where the objective functions embed both sides of the ensemble components: the data clusterings and the assignments of features to clusters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-view clustering via pairwise sparse subspace representation

Multi-view clustering, which aims to cluster datasets with multiple sources of information, has a wide range of applications in the communities of data mining and pattern recognition. Generally, it makes use of the complementary information embedded in multiple views to improve clustering performance. Recent methods usually find a low-dimensional embedding of multi-view data, but often ignore s...

متن کامل

Multiple Clustering Views from Multiple Uncertain Experts

Expert input can improve clustering performance. In today’s collaborative environment, the availability of crowdsourced multiple expert input is becoming common. Given multiple experts’ inputs, most existing approaches can only discover one clustering structure. However, data is multi-faceted by nature and can be clustered in different ways (also known as views). In an exploratory analysis prob...

متن کامل

Repeated Record Ordering for Constrained Size Clustering

One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...

متن کامل

Multi-View Clustering via Joint Nonnegative Matrix Factorization

Many real-world datasets are comprised of different representations or views which often provide information complementary to each other. To integrate information from multiple views in the unsupervised setting, multiview clustering algorithms have been developed to cluster multiple views simultaneously to derive a solution which uncovers the common latent structure shared by multiple views. In...

متن کامل

ASCLU: Alternative Subspace Clustering

Finding groups of similar objects in databases is one of the most important data mining tasks. Recently, traditional clustering approaches have been extended to generate alternative clustering solutions. The basic observation is that for each database object multiple meaningful groupings might exist: the data allows to be clustered through different perspectives. It is thus reasonable to search...

متن کامل

Subspace clustering for complex data

Clustering is an established data mining technique for grouping objects based on their mutual similarity. Since in today’s applications, however, usually many characteristics for each object are recorded, one cannot expect to find similar objects by considering all attributes together. In contrast, valuable clusters are hidden in subspace projections of the data. As a general solution to this p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012